================================================================================ PROGRAMMERS' NOTES 80 columns tab=4 spaces for anal analall bpdagg CSequence CPeakList CTrace CTraceFile FileFormat autoseq xlate RIncludes RInlines DNA This document describes conventions and nomenclature for the sources of the programs mentioned above. This package comprises the sources developed for the manipulation and analysis of chromatogram data generated by automated sequencing of DNA. The project was undertaken for the requirements of a Masters Degree in Computer Science and Bioengineering Certificate at Washington University in St. Louis, Missouri, USA. The thesis advisor was David States. Everything described herein has been released to the public domain. Bug reports, bug fixes, comments, suggestions, extensions, and the like are desired. Contact Addresses: Reece Hart David States reece@ibc.wustl.edu states@ibc.wustl.edu ================================================================================ CONTENTS -------- PROGRAM DESCRIPTIONS SOURCE DESCRIPTIONS CONVENTIONS KNOWN PROBLEMS, QUIRKS, AND IMPROVEMENT SUGGESTIONS PROGRAM DESCRIPTIONS -------------------- analyze - Runs a complete analysis suite on a named abi file. See script header for usage. analall - Runs anal on all abi files found in the current directory. See script header for usage. bpdagg - aggregates bpd files for individual bases into a single list sorted by base position. autoseq - an interface to the above modules. It peforms essentially no calculations itself, but instead directs the classes to perform the actions themselves. xlate - coverts ABI to SCF; the only advantage of this over makeSCF is that input format may be specified to be ABI0, which is the /raw/ data obtained from the sequencer. This will probably not be of general interest. SOURCE DESCRIPTIONS ------------------- All source was written in C++ using AT&T C++ 3.01. I have noted errors during compilation with g++, but have not yet attempted to correct them. CSequence - A simple bidirectional linearly-linked list template. It supports essentially any data type. CPeakList - Defines the PeakRec structure (class) and some simple methods. CPeakList is built on a CSequence and implements many methods for the manipulation and analysis of a collection of peaks. CTrace - A template class which stores a large sequence (array) of any numerical type. It performs many statistical and analytical functions such as derivatives (returned as a CTrace, from which subsequent derivatives may be obtained), peak picking, scaling, translating, and I/O. CTraceFile - Assembles a collection of CTrace's and a number of other data members which represents any of several formats of chromatograms from automated DNA sequencing experiments. It currently supports reading and writing Standard Chromatogram Format (SCF) files, and reading any of the data sets within an Applied Biosystems, Inc. (ABI) file. FileFormat - Simple routines for the determination and description of chromatogram file formats. RIncludes - a set of common definitions, typedefs, etc. RInlines - a set of useful inline routines DNA - some simple DNA definitions and types CONVENTIONS ----------- * I've tried to provide a consistent coding style and this style relies heavily on tab = 4 spaces. KNOWN PROBLEMS, QUIRKS, AND IMPROVEMENT SUGGESTIONS --------------------------------------------------- * The baseline command in autoseq is ambiguous: It actually /translates/ the data. There should be separate baseline and translation flags. * Assimilation of the peaks may have a problem because it inherits the peak records from the individual traces. That is, it may be the case that two lists point to the same PeakRec. This requires some investigation. * For a series of peaks pairwise separated by less than some minimum separation, exactly one peak is chosen. For series which span a region in which more than one real peak exists, some peaks will be discarded. Therefore, there's a balance between parameters which result in abundant peaks (thus resulting in a large series of peaks in close pairwise proximity) and minimum separation criteria which prune peaks with 'reasonable' separation. For small minSeparation arguments (ie. <=5), this generally isn't a problem and only the peaks which result from noise are tossed (as was originally desired). * In several cases, I've not made new class where I probably should have. For instance (ahem), I use the same CPeakList for both the peaks of individual traces and the assimilated list. However, the assimilated list really has no need for the statistical methods (in fact, their application to this list would be meaningless). Pruning peaks should not be a tracefile function, however this was necessary because the assimilated list needs to know about all 4 of the individual trace's peak lists. Thus, these classes are really not as absolutely modular as they could/should be. * The ted source has sparse references to a 'bottom' variable. I've assumed that this is only affect this has is to invert the trace and edit the reverse- compliment of the sequence. I'm not aware of any other affects this has on the trace data. * Class hierarchy The current hierarchy is quite simple and is described above. It has worked well for prototyping this system and is functional even for non-protyping purposes. However, I believe that a more abstract interpretation of the types is now appropriate (and fairly easily done with the sources provided). CSequence more complete list operators and iterators (sort, doforeach, etc.) CArray sampling data stat fx sorting histograms derivatives peak picking CTrace: CArray CTraceSet collection of CTraces orthogonalization group calls to CArray methods (ie. CalcStats, PickPeaks) resolve peaks CTraceFile CTraceSet reading and writing tracefiles Peaks could be stored in a CPeakList as is done currently, or in a CSequence<>. Different peak recs should be used for trace v. set peaks. Copy constructors for each class.